Predicting Spotify track popularity based on certain audio features

Group 27
2021/11/25 (updated: 2021-12-11)

Summary

We use Ridge algorithm here to build a regression model to predict the popularity of spotify tracks based on features like danceability, loudness, tempo etc. The popularity score ranges from 0 to 100. A popularity score of 0 means the the song has minute popularity and a popularity score of 100 means the song is extremely popular.

Introduction

Some songs sit atop popularity charts like the Billboard charts while certain other songs comfortably sit at the bottom of the charts. Some songs don’t even chart at all. This pose an interesting question to us on what exactly makes a song popular and we ask if we can be able to predict how popular a song will get. based on certain features. Some songs are unexceptionally popular while some other songs are not as popular. This is an attempt to answer this interesting question. We attempt here to make a prediction on the popularity of a song based on certain features.

According to this report, approximately 137 million new songs are released every year, and only about 14 records have sold 15 million physical copies or more in global history. Therefore, it is important to determine what exactly determines a track popularity and specifically make predictions on how popular a song will get based on based on features like danceability, loudness, tempo etc.

Methods

Data

The dataset used in this project was sourced from Tidy Tuesday’s github repo here, and particularly here. The data, however, originally comes from Data.World, Billboard.com and Spotify. Each row from the dataset represents a song’s features and a target column specifying the song’s popularity on a scale of 0 (least popularity) to 100 (most popularity).

Analysis

Ridge model was built to answer our research question (to predict spotify tracks popularity). This is a regression solution and predictions range from 0 (least popularity) to 100 (most popularity). All features in the original dataset were used to fit the model with the exceptions of ‘song_id,’ ‘spotify_track_id,’ ‘spotify_track_album’ features. A 10-fold cross-validation was used for hyperparameter optimization. The code used to perform this analysis can be found here.

Results & Discussion

It is usually very important to look at how the features are co-related and to see what their pairwise distributions look like. Here, the blue plots (and a fitting line) represents the paired distributions of the features, and the other boxes are the paired correlations of the features. As can be seen, the correlations are fair and not unreasonable, hence the features can be used together for building the Ridge model that seeks to answer the predictive question.

Figure 1. Pairwise distributions and correlations of all features

We adopted a simple linear regression model - Ridge algorithm. Our choice of Ridge stems from the fact it it is regularized and take care of the multi-collinearity problem. A 10-fold cross validation was carried out and the train and validation R2 scores reported in the table below from cross-validation

Table 1. Train and validation scores from cross-validation
fit_time score_time test_score train_score
0.6432571 0.0582726 0.4782156 0.7925252
0.6447966 0.0585334 0.4740333 0.7918603
0.6070879 0.0596128 0.5033566 0.7890991
0.6175146 0.0581553 0.4766368 0.7928274
0.6207914 0.0589225 0.4459885 0.7938532
0.6111379 0.0589015 0.4791355 0.7909243
0.6139162 0.0656796 0.4707675 0.7920449
0.6158381 0.0594504 0.4846590 0.7931858
0.6065817 0.0584295 0.4647415 0.7921520
0.6169782 0.0587053 0.5039069 0.7905965

The following table shows the results of RandomizedSearchCV for determining the best hyperparameters for the Ridge model.

Table 2. Best hyperparameters from RandomizedSearchCV
mean_test_score param_ridge\_\_alpha param_columntransformer\_\_countvectorizer-1\_\_max_features param_columntransformer\_\_countvectorizer-1\_\_binary param_columntransformer\_\_countvectorizer-2\_\_max_features param_columntransformer\_\_countvectorizer-2\_\_binary
0.4971355 1e+00 1000 TRUE 1000 FALSE
0.4943319 1e+00 1000 FALSE 1000 FALSE
0.4933197 1e+01 1000 TRUE 1000 FALSE
0.4552179 1e-01 1000 TRUE 1000 FALSE
0.4523615 1e-01 1000 FALSE 1000 FALSE
0.4517998 1e-01 1000 FALSE 1000 TRUE
0.4408479 1e+02 1000 TRUE 1000 FALSE
0.4402483 1e+02 1000 FALSE 1000 FALSE
0.4384227 1e-03 1000 TRUE 1000 TRUE
0.4363332 1e-03 1000 FALSE 1000 FALSE

In order to evaluate the performance of our model, we made some predictions and compared the predicted values with the actual values. We have plotted this below. The Goodness of Fit below is not unreasonable and shows the viability of the ridge model.

Figure 2. Comparison of actual vs. predicted values

In order to improve this model in the future where we can have excellent reliability on the model predictions, we will need the right combination of data. The data used here are mostly Spotify and Billboard-based. In the future, we’ll look at aggregating data from other sources as well. Also, the ridge model deployed did not perform greatly, we’ll look into more sophisticated feature engineering and model training in the future.

References

Canadian Cancer Statistics Advisory Committee. 2019. “Canadian Cancer Statistics.” Canadian Cancer Society. http://cancer.ca/Canadian-Cancer-Statistics-2019-EN.

de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2019. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Keleshev, Vladimir. 2014. Docopt: Command-Line Interface Description Language. https://github.com/docopt/docopt.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.